Integration of pre-treatment computational radiomics, deep radiomics, and transcriptomics enhances soft-tissue sarcoma patient prognosis

Our objective was to capture subgroups of soft-tissue sarcoma (STS) using handcraft and deep radiomics approaches to understand their relationship with histopathology, gene-expression profiles, and metastatic relapse-free survival (MFS). We included all consecutive adults with newly diagnosed locally advanced STS (N = 225, 120 men, median age: 62 years) managed at our sarcoma reference center between 2008 and 2020, with contrast-enhanced baseline MRI. After MRI postprocessing, segmentation, and reproducibility assessment, 175 handcrafted radiomics features (h-RFs) were calculated. Convolutional autoencoder neural network (CAE) and half-supervised CAE (HSCAE) were trained in repeated cross-validation on representative contrast-enhanced slices to extract 1024 deep radiomics features (d-RFs). Gene-expression levels were calculated following RNA sequencing (RNAseq) of 110 untreated samples from the same cohort. Unsupervised classifications based on h-RFs, CAE, HSCAE, and RNAseq were built. The h-RFs, CAE, and HSCAE grouping were not associated with the transcriptomics groups but with prognostic radiological features known to correlate with lower survivals and higher grade and SARCULATOR groups (a validated prognostic clinical-histological nomogram). HSCAE and h-RF groups were also associated with MFS in multivariable Cox regressions. Combining HSCAE and transcriptomics groups significantly improved the prognostic performances compared to each group alone, according to the concordance index. The combined radiomic-transcriptomic group with worse MFS was characterized by the up-regulation of 707 genes and 292 genesets related to inflammation, hypoxia, apoptosis, and cell differentiation. Overall, subgroups of STS identified on pre-treatment MRI using handcrafted and deep radiomics were associated with meaningful clinical, histological, and radiological characteristics, and could strengthen the prognostic value of transcriptomics signatures.


Supplementary Table
. Survival analysis in the subcohort of n= 110 patients with both radiomics and transcriptomics data: assessment of the added value of the HSCAE transcriptomics groups to the SARCULATOR groups Supplementary Table ST8.Gene and Pathway enrichment analysis between the CINSARC gene signature and the DGE list of genes identified in our study.
We wanted to determine the intersection between the list of genes of Cinsarc and the list of gene identified by the DGE analysis in our study.
We found that when both Fold Change ≥ 2 and adjusted P-value of 0.05 are enforced, 25/67 (37%) CINSARC genes were found to be significantly expressed in the differential gene expression analysis of the A RNA × A HSCAE versus Other groups.
In addition, when we enforced only the P-value significance constraint but not the fold change, 54/67 (80%) CINSARC genes have their reproducibility confirmed in the RnaSeq experiment.Also, the CPDB algorithm 41 ,was used to generate pathway enrichment on both the list of 67 genes of CINSARC 22 , and the list of 1230 DGE genes found in the comparison between the A RNA × A HSCAE versus 'Other' groups of our study.

P-value
Seventeen pathways were found in common between the two analyses covering molecular biology functionalities of mitosis control and chromosome integrity.In the original study by Chibon et al. 22 , the authors came to similar conclusion concerning the biological interpretation of the CINSARC signature.
Overall, this analysis emphasizes that the DGE analysis between the A RNA × A HSCAE versus the Other groups is coherent with the genes and functionalities of the CINSARC signature.This package also detected the proportion of nucleotide fragments whose R1 and R2 paired end reads were overlapping and merged them into single-end reads.
To prevent double coverage bias due to overlapping R1 and R2 sequences and keep exploiting those fragments, a home-made python script was developed to split those merged reads into new non-overlapping R1 and R2 paired-end reads.Alignment of quality controlled DNA sequences to .bamfiles was performed via Bowtie2 (1) with the '-very-sensitive' alignment strategy parameter.Alignment of RNA sequences was performed using Tophat2 (2) and Bowtie2 (1) on both the UCSC hg19 reference genome and transcriptome.Remaining PCR duplicates reads in Post-alignment .bamfiles were removed via the PicardTools suite MarkDuplicates module (PicardTools -https://broadinstitute.github.io/picard/).

Normalization of Gene Expression data.
ComBat harmonization method was applied to correct for batch effect 31 between the 54 patients with frozen samples and the 56 patients with FFPE samples.We used the Voom method to normalize the trancript counts of our samples.This is a method that transforms count data to log2-counts per million (logCPM), then estimates the mean-variance relationship and finally uses this relationship to compute appropriate observationnal-level weights.
* Supplementary Method M3.Gene-expression profiles of the radiomics groups.
The custom code used to generate the results presented in this paper is available from the corresponding author upon reasonable request.
Differential gene expression (DGE) and Geneset enrichment analyses between the radiomics groups determined by unsupervised clustering were performed as follows.

Differential analysis.
The RNA-Seq differential gene expression between defined groups was performed using the statistical t-test from the 'LIMMA' R package.It calculates fold change and nominal P-values related to each gene starting from raw expression values and the normalization weights produced by the 'VOOM' R package.The set of nominal P-values were adjusted using Benjamini-Hochberg procedure.The P-value threshold cut-off to discriminate significant up/down regulated transcripts was set to 0.05 and the fold change was set to 2.0 40 .

Geneset enrichment analysis.
We assessed enrichment in biological pathways between RNA-Sequencing groups based on Broad Institute's Molecular Signature Data Base gene sets (MSigDB) and the CYBERSORT LM22 immuno gene sets.LM22 is a signature matrix file consisting of 547 genes that accurately distinguish 22 mature human hematopoietic populations isolated from peripheral blood or in vitro culture conditions, including seven T cell types, naïve and memory B cells, plasma cells, NK cells, and myeloid subsets 33,34 .Gene-set enrichment was performed as follows: the total count of genes (UNIVERSE) was set to the total number of genes for which the Differential Gene Expression (DGE) calculation was performed (18,399 annotated genes in our study).For each biological pathway we counted the genes that participated in the pathway (IN.GENESET) and its complement with respect to the UNIVERSE (NOT.IN.GENESET).Also the UNIVERSE was split in differentially expressed genes (SIG.DGE) and their complement (NOT.SIG.DGE).Then we built a 2 x 2 contingency matrix holding the count of differentially expressed genes in the pathway (SIG.The raw P-values of all tests for each pathway were adjusted.We additionally performed the calculation of the odds ratio (OR) based on the above contingency table and its 95% confidence interval (CI, results of the fisher.testfunction R base package).We tagged as significantly ENRICHED all pathways with a significant hypergeometric test at adjusted Pvalue ≤ 0.05 and the lower bound of the 95%CI of their OR not lower than 1.We ranked all ENRICHED-tagged pathways by the count of differentially expressed genes in the pathway (SIG.DGE.IN.GENSET).Then we wanted to qualitatively assess pathways significantly overexpressed or under-expressed in sample groups.To that extent, and for each pathway and each group, we calculate the ratio between the numbers of genes that were over-expressed in the pathway and the numbers of DGE genes in the pathway.A ratio close to 0 suggests that the pathway was under-expressed while a ratio close to 1 suggested that the pathway was over-expressed in the reference group.Ratios that were close to 0.5 were considered informative of no enrichment in either group.

PAMr analysis.
Prediction Analysis for Microarrays (PAM) provides a method for class prediction from gene expression profiling, based on an enhancement of the simple nearest prototype (centroid) classifier 35 .The 'nearest shrunken centroids' identifies subsets of genes that best characterize each class.Briefly, we provided to PAMr the full expression matrix normalized as previously described in Supplementary methods.Each sample was assigned to a specific class of interest to determine a subset of discriminant genes for binary classification.The signature was chosen based on PAM shrunken fold change threshold of 2.8 and FDR of 0.05 that guaranteed a mean binary class cross-validation error of 10% (cross-validation by 100 random sample reshuffling iterations)
DGE.IN.GENESET) and its complement (SIG.DGE.NOT.IN.GENESET) and the count of not differentially expressed genes in the pathway (NOT.SIG.DGE.IN.GENESET) and its complement (NOT.SIG.DGE.NOT.IN.GENESET).We performed the hypergeometric test on this table via fisher test in R.